Overview

Dataset statistics

Number of variables10
Number of observations20640
Missing cells207
Missing cells (%)0.1%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory1.5 MiB
Average record size in memory76.0 B

Variable types

NUM9
CAT1

Reproduction

Analysis started2020-06-09 16:04:50.936760
Analysis finished2020-06-09 16:05:40.006453
Duration49.07 seconds
Versionpandas-profiling v2.8.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml

Warnings

latitude is highly correlated with longitudeHigh correlation
longitude is highly correlated with latitudeHigh correlation
total_bedrooms is highly correlated with total_rooms and 1 other fieldsHigh correlation
total_rooms is highly correlated with total_bedrooms and 1 other fieldsHigh correlation
households is highly correlated with total_rooms and 2 other fieldsHigh correlation
population is highly correlated with householdsHigh correlation
total_bedrooms has 207 (1.0%) missing values Missing

Variables

longitude
Real number (ℝ)

HIGH CORRELATION

Distinct count844
Unique (%)4.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean-119.56970445736432
Minimum-124.35
Maximum-114.31
Zeros0
Zeros (%)0.0%
Memory size161.2 KiB

Quantile statistics

Minimum-124.35
5-th percentile-122.47
Q1-121.8
median-118.49
Q3-118.01
95-th percentile-117.08
Maximum-114.31
Range10.04
Interquartile range (IQR)3.79

Descriptive statistics

Standard deviation2.003531724
Coefficient of variation (CV)-0.01675618195
Kurtosis-1.330152366
Mean-119.5697045
Median Absolute Deviation (MAD)1.28
Skewness-0.297801208
Sum-2467918.7
Variance4.014139367
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
-118.311620.8%
 
-118.31600.8%
 
-118.291480.7%
 
-118.271440.7%
 
-118.321420.7%
 
-118.281410.7%
 
-118.351400.7%
 
-118.361380.7%
 
-118.191350.7%
 
-118.251280.6%
 
Other values (834)1920293.0%
 
ValueCountFrequency (%) 
-124.351< 0.1%
 
-124.32< 0.1%
 
-124.271< 0.1%
 
-124.261< 0.1%
 
-124.251< 0.1%
 
ValueCountFrequency (%) 
-114.311< 0.1%
 
-114.471< 0.1%
 
-114.491< 0.1%
 
-114.551< 0.1%
 
-114.561< 0.1%
 

latitude
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count862
Unique (%)4.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean35.63186143410853
Minimum32.54
Maximum41.95
Zeros0
Zeros (%)0.0%
Memory size161.2 KiB

Quantile statistics

Minimum32.54
5-th percentile32.82
Q133.93
median34.26
Q337.71
95-th percentile38.96
Maximum41.95
Range9.41
Interquartile range (IQR)3.78

Descriptive statistics

Standard deviation2.135952397
Coefficient of variation (CV)0.05994501302
Kurtosis-1.117759781
Mean35.63186143
Median Absolute Deviation (MAD)1.23
Skewness0.4659530037
Sum735441.62
Variance4.562292644
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
34.062441.2%
 
34.052361.1%
 
34.082341.1%
 
34.072311.1%
 
34.042211.1%
 
34.092121.0%
 
34.022081.0%
 
34.12031.0%
 
34.031930.9%
 
33.931810.9%
 
Other values (852)1847789.5%
 
ValueCountFrequency (%) 
32.541< 0.1%
 
32.553< 0.1%
 
32.5610< 0.1%
 
32.57180.1%
 
32.58260.1%
 
ValueCountFrequency (%) 
41.952< 0.1%
 
41.921< 0.1%
 
41.881< 0.1%
 
41.863< 0.1%
 
41.841< 0.1%
 

housing_median_age
Real number (ℝ≥0)

Distinct count52
Unique (%)0.3%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean28.639486434108527
Minimum1.0
Maximum52.0
Zeros0
Zeros (%)0.0%
Memory size161.2 KiB

Quantile statistics

Minimum1
5-th percentile8
Q118
median29
Q337
95-th percentile52
Maximum52
Range51
Interquartile range (IQR)19

Descriptive statistics

Standard deviation12.58555761
Coefficient of variation (CV)0.4394477408
Kurtosis-0.8006288536
Mean28.63948643
Median Absolute Deviation (MAD)10
Skewness0.0603306376
Sum591119
Variance158.3962604
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
5212736.2%
 
368624.2%
 
358244.0%
 
167713.7%
 
176983.4%
 
346893.3%
 
266193.0%
 
336153.0%
 
185702.8%
 
255662.7%
 
Other values (42)1315363.7%
 
ValueCountFrequency (%) 
14< 0.1%
 
2580.3%
 
3620.3%
 
41910.9%
 
52441.2%
 
ValueCountFrequency (%) 
5212736.2%
 
51480.2%
 
501360.7%
 
491340.6%
 
481770.9%
 

total_rooms
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count5926
Unique (%)28.7%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean2635.7630813953488
Minimum2.0
Maximum39320.0
Zeros0
Zeros (%)0.0%
Memory size161.2 KiB

Quantile statistics

Minimum2
5-th percentile620.95
Q11447.75
median2127
Q33148
95-th percentile6213.2
Maximum39320
Range39318
Interquartile range (IQR)1700.25

Descriptive statistics

Standard deviation2181.615252
Coefficient of variation (CV)0.8276977802
Kurtosis32.630927
Mean2635.763081
Median Absolute Deviation (MAD)797
Skewness4.147343451
Sum54402150
Variance4759445.106
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
1527180.1%
 
1613170.1%
 
1582170.1%
 
2127160.1%
 
1703150.1%
 
1471150.1%
 
2053150.1%
 
1722150.1%
 
1607150.1%
 
1717150.1%
 
Other values (5916)2048299.2%
 
ValueCountFrequency (%) 
21< 0.1%
 
61< 0.1%
 
81< 0.1%
 
111< 0.1%
 
121< 0.1%
 
ValueCountFrequency (%) 
393201< 0.1%
 
379371< 0.1%
 
326271< 0.1%
 
320541< 0.1%
 
304501< 0.1%
 

total_bedrooms
Real number (ℝ≥0)

HIGH CORRELATION
MISSING

Distinct count1923
Unique (%)9.4%
Missing207
Missing (%)1.0%
Infinite0
Infinite (%)0.0%
Mean537.8705525375618
Minimum1.0
Maximum6445.0
Zeros0
Zeros (%)0.0%
Memory size161.2 KiB

Quantile statistics

Minimum1
5-th percentile137
Q1296
median435
Q3647
95-th percentile1275.4
Maximum6445
Range6444
Interquartile range (IQR)351

Descriptive statistics

Standard deviation421.3850701
Coefficient of variation (CV)0.7834321252
Kurtosis21.98557506
Mean537.8705525
Median Absolute Deviation (MAD)162
Skewness3.459546332
Sum10990309
Variance177565.3773
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
280550.3%
 
331510.2%
 
345500.2%
 
393490.2%
 
343490.2%
 
394480.2%
 
328480.2%
 
348480.2%
 
272470.2%
 
309470.2%
 
Other values (1913)1994196.6%
 
(Missing)2071.0%
 
ValueCountFrequency (%) 
11< 0.1%
 
22< 0.1%
 
35< 0.1%
 
47< 0.1%
 
56< 0.1%
 
ValueCountFrequency (%) 
64451< 0.1%
 
62101< 0.1%
 
54711< 0.1%
 
54191< 0.1%
 
52901< 0.1%
 

population
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count3888
Unique (%)18.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean1425.4767441860465
Minimum3.0
Maximum35682.0
Zeros0
Zeros (%)0.0%
Memory size161.2 KiB

Quantile statistics

Minimum3
5-th percentile348
Q1787
median1166
Q31725
95-th percentile3288
Maximum35682
Range35679
Interquartile range (IQR)938

Descriptive statistics

Standard deviation1132.462122
Coefficient of variation (CV)0.7944444737
Kurtosis73.55311639
Mean1425.476744
Median Absolute Deviation (MAD)440
Skewness4.935858227
Sum29421840
Variance1282470.457
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
891250.1%
 
761240.1%
 
1227240.1%
 
850240.1%
 
1052240.1%
 
825230.1%
 
999220.1%
 
782220.1%
 
1005220.1%
 
781210.1%
 
Other values (3878)2040998.9%
 
ValueCountFrequency (%) 
31< 0.1%
 
51< 0.1%
 
61< 0.1%
 
84< 0.1%
 
92< 0.1%
 
ValueCountFrequency (%) 
356821< 0.1%
 
285661< 0.1%
 
163051< 0.1%
 
161221< 0.1%
 
155071< 0.1%
 

households
Real number (ℝ≥0)

HIGH CORRELATION

Distinct count1815
Unique (%)8.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean499.5396802325581
Minimum1.0
Maximum6082.0
Zeros0
Zeros (%)0.0%
Memory size161.2 KiB

Quantile statistics

Minimum1
5-th percentile125
Q1280
median409
Q3605
95-th percentile1162
Maximum6082
Range6081
Interquartile range (IQR)325

Descriptive statistics

Standard deviation382.3297528
Coefficient of variation (CV)0.7653641301
Kurtosis22.05798806
Mean499.5396802
Median Absolute Deviation (MAD)151
Skewness3.410437712
Sum10310499
Variance146176.0399
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
306570.3%
 
386560.3%
 
335560.3%
 
282550.3%
 
429540.3%
 
375530.3%
 
284510.2%
 
297510.2%
 
362500.2%
 
380500.2%
 
Other values (1805)2010797.4%
 
ValueCountFrequency (%) 
11< 0.1%
 
23< 0.1%
 
34< 0.1%
 
44< 0.1%
 
57< 0.1%
 
ValueCountFrequency (%) 
60821< 0.1%
 
53581< 0.1%
 
51891< 0.1%
 
50501< 0.1%
 
49301< 0.1%
 

median_income
Real number (ℝ≥0)

Distinct count12928
Unique (%)62.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.8706710029069766
Minimum0.4999
Maximum15.0001
Zeros0
Zeros (%)0.0%
Memory size161.2 KiB

Quantile statistics

Minimum0.4999
5-th percentile1.60057
Q12.5634
median3.5348
Q34.74325
95-th percentile7.300305
Maximum15.0001
Range14.5002
Interquartile range (IQR)2.17985

Descriptive statistics

Standard deviation1.899821718
Coefficient of variation (CV)0.4908249026
Kurtosis4.952524102
Mean3.870671003
Median Absolute Deviation (MAD)1.0642
Skewness1.646656702
Sum79890.6495
Variance3.60932256
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
3.125490.2%
 
15.0001490.2%
 
2.875460.2%
 
4.125440.2%
 
2.625440.2%
 
3.875410.2%
 
3380.2%
 
3.375380.2%
 
3.625370.2%
 
4370.2%
 
Other values (12918)2021798.0%
 
ValueCountFrequency (%) 
0.4999120.1%
 
0.53610< 0.1%
 
0.54951< 0.1%
 
0.64331< 0.1%
 
0.67751< 0.1%
 
ValueCountFrequency (%) 
15.0001490.2%
 
152< 0.1%
 
14.90091< 0.1%
 
14.58331< 0.1%
 
14.42191< 0.1%
 

median_house_value
Real number (ℝ≥0)

Distinct count3842
Unique (%)18.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean206855.81690891474
Minimum14999.0
Maximum500001.0
Zeros0
Zeros (%)0.0%
Memory size161.2 KiB

Quantile statistics

Minimum14999
5-th percentile66200
Q1119600
median179700
Q3264725
95-th percentile489810
Maximum500001
Range485002
Interquartile range (IQR)145125

Descriptive statistics

Standard deviation115395.6159
Coefficient of variation (CV)0.55785531
Kurtosis0.3278702429
Mean206855.8169
Median Absolute Deviation (MAD)68400
Skewness0.9777632739
Sum4269504061
Variance1.331614816e+10
Histogram with fixed size bins (bins=10)
ValueCountFrequency (%) 
5000019654.7%
 
1375001220.6%
 
1625001170.6%
 
1125001030.5%
 
187500930.5%
 
225000920.4%
 
350000790.4%
 
87500780.4%
 
275000650.3%
 
150000640.3%
 
Other values (3832)1886291.4%
 
ValueCountFrequency (%) 
149994< 0.1%
 
175001< 0.1%
 
225004< 0.1%
 
250001< 0.1%
 
266001< 0.1%
 
ValueCountFrequency (%) 
5000019654.7%
 
500000270.1%
 
4991001< 0.1%
 
4990001< 0.1%
 
4988001< 0.1%
 

ocean_proximity
Categorical

Distinct count5
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size80.6 KiB
<1H OCEAN
9136
INLAND
6551
NEAR OCEAN
2658
NEAR BAY
2290
ISLAND
 
5
ValueCountFrequency (%) 
<1H OCEAN913644.3%
 
INLAND655131.7%
 
NEAR OCEAN265812.9%
 
NEAR BAY229011.1%
 
ISLAND5< 0.1%
 

Length

Max length10
Median length9
Mean length8.064922481
Min length6

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

Sample

First rows

longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY
5-122.2537.8552.0919.0213.0413.0193.04.0368269700.0NEAR BAY
6-122.2537.8452.02535.0489.01094.0514.03.6591299200.0NEAR BAY
7-122.2537.8452.03104.0687.01157.0647.03.1200241400.0NEAR BAY
8-122.2637.8442.02555.0665.01206.0595.02.0804226700.0NEAR BAY
9-122.2537.8452.03549.0707.01551.0714.03.6912261100.0NEAR BAY

Last rows

longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
20630-121.3239.2911.02640.0505.01257.0445.03.5673112000.0INLAND
20631-121.4039.3315.02655.0493.01200.0432.03.5179107200.0INLAND
20632-121.4539.2615.02319.0416.01047.0385.03.1250115600.0INLAND
20633-121.5339.1927.02080.0412.01082.0382.02.549598300.0INLAND
20634-121.5639.2728.02332.0395.01041.0344.03.7125116800.0INLAND
20635-121.0939.4825.01665.0374.0845.0330.01.560378100.0INLAND
20636-121.2139.4918.0697.0150.0356.0114.02.556877100.0INLAND
20637-121.2239.4317.02254.0485.01007.0433.01.700092300.0INLAND
20638-121.3239.4318.01860.0409.0741.0349.01.867284700.0INLAND
20639-121.2439.3716.02785.0616.01387.0530.02.388689400.0INLAND